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ABSTRACT: We study some features of learning models based on "delayed" 
and undifferentiated reinforcement. We concentrate on models based on 
primitive algorithms without sophisticated structure and which might be 
considered as very elementary. 

1 Introduction 

Introducing biologically motivated features in models for learning has usually 
a double role: testing hypotheses for natural learning and finding hints for 
artificial learning. Here we do not take the more ambitious point of view of 
finding optimal algorithms for the latter. On the contrary, our motivation is 
to investigate which are the capabilities of very elementary mechanisms. We 
shall consider a number of learning aspects which may appear biologically 
motivated (by which we mean primarily the functional aspects) . Specifically 
we shall consider learning models called here, for short, "stochastic learning" , 
as involving the following features: 

1. implementation achieved by acting stochastically in a structured envi- 
ronment; 

2. control by non-specific reinforcement, that is, the reinforcement is given 
according to the success of series of actions; 

3. reinforcement acting via the normal, internal activity of the system 
(requires no "external computation"). 

The first point is essential for the perspective taken here which sees learn- 
ing as result of the interaction between an outward action, spontaneous and 
random in its basis, and an inward feedback reflecting the environmental 
conditions. The second point represents the normal situation for this prob- 
lem setting, since the agent usually will only find out at the end of the day 
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whether he is successful or not (e.g., survives or not) - but will not be told 
how good or bad was each of its steps. The third point is somewhat subtle, 
essentially it suggests that no algorithm should be used which is of a higher 
level than that defining the agent. 

From the point of view of reinforcement learning our problem may be 
seen under "class III" problems in the classification of Hertz et al. (1991). 
However, we stress that our attitude is not that of finding good algorithms 
for tackling special problems, like movement, control or games - see, e.g., 
Kaelbling (1996). Instead we want to test whether under these quite restric- 
tive conditions the internal structure of the agent will become differentiated 
enough to meet the conditions of the environment. For this reason we do not 
consider evolved algorithms from the class of Q-learning (Watkins 1989), of 
TD learning (Sutton 1988), agent and critic (Barto, Sutton and Anderson 
1983), etc but restrict to most primitive algorithms which we may think of 
having a chance to have developed under natural conditions. On the other 
hand, if such algorithms will prove capable of tackling the problem they may 
well give further insights. 

An illustration of the problem was provided in an earlier paper (Mlodi- 
now and Stamatescu 1985) dealing with these questions in the simulation of 
a device moving on a board. The device realizes a biased random walk, the 
bias coming from trying to recognize situations and considering the "good- 
ness" of the previously taken, corresponding moves. There is absolutely no 
structure presupposed in the behavior of the device, beyond the sheer urge 
to move (completely at random to start with), the structure is fully hid- 
den in the environment - according to point 1. above. The reinforcement 
(positive or negative) itself is global, it is associated to the results of long 
chains of moves. It is assigned undiscriminately to all moves in the chain to 
modify their "degree of goodness" (see point 2.). Under these conditions the 
device shows a number of interesting, quantifiable features: "flexible stabil- 
ity" (its behavior fluctuates around a solution - path to the goal - without 
loosing it, unless as a result of the fluctuations a better solution is found); 
"development" (in the course of the training on harder and harder problems, 
solutions to the simpler problems are taken over and applied to subproblems 
of the complex case); "alternatives handling" (in a continuously changing 
environment the device develops alternative solutions, which it distinguishes 
by simple cues found in the environment); "learning from success and failure" 
(all experiences contribute); etc. 
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This simulation produced therefore arguments for the realizability of this 
model of "stochastic learning". However the last point (3.) appeared less 
clear, since the reinforcement, although simple, did not proceed via the ac- 
tivity of the system itself: it had to fit the situations seen to what it has 
saved, it had to choose actions, it was given the possibility of recognizing 
impossible actions, etc. 

A more natural frame for our problem is indeed provided by neural net- 
works. In section 3 we shall present as illustration an analog simulation to 
the one mentioned above, realized by a neural network, and where we con- 
form in a clear way also to point 3. above. However, in the context of neural 
networks we can analyze such problems in a more systematic way trying to 
achieve results of statistical significance, and not only illustratively. In the 
section 2 we shall deal with a realization of our problem on perceptrons. 



2 Learning rule for perceptrons under unspe- 
cific reinforcement 

We consider perceptrons with Ising units s, = 1,-1 and real weights (synapses) 

s = sign (^2 c i s ^j = si g n ( c - s ) (!) 

N is the number of neurons and we put no explicit thresholds. The network 
(pupil) is presented with a series of patterns s\ q ' l \ q = 1,...,Q, I = 1,...,L 
to which it answers with s^ q ' l \ A training period consists of the successive 
presentation of L patterns. The answers are compared with the correspond- 
ing answers of a teacher with pre-given weights Tj and the average error 
made by the pupil over one training period is calculated: 

e, = ±£| f W>- a W>|. (2) 
lL i=i 

The training algorithm consists of two parts: 

I. - a "blind" Hebb-type association at each presentation of a pattern: 

c («,i+i) = C (?,0 + aiS (9.0 s (</,0. (3) 
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II. - an "unspecific" reinforcement, also Hebbian, at the end of each train- 
ing period 

c (5+i,D = C (^+i) _ a26q S (9'0 S (9,0. (4) 

1=1 

We shall call this algorithm "2-Hebb rule" and we shall be interested in the 
behavior of the error e(Q) for large Q - or, alternatively, the approach of the 
pupil weights toward those of the teacher P(Q): 

I M f~i rp 

<Q) = 2M E \ t(m) - si S n (C (Qll) -s (m) ) I , = (5) 

Both the training patterns and the test patterns s*"^ are generated 
randomly. We shall test whether the behavior of e(Q) can be reproduce by 
a power law at large Q: 

e(Q) ~ constQ- p (6) 

Notice the following features: 

a) In the training the pupil only uses its own associations s^ q ' l \ s^ q '^ and 
the average error marge e q which does not refer specifically to the par- 
ticular steps I. 

b) Since the answers s*- 9 '^ are made on the basis of the instantaneous 
weight values which change at each step according to eq. (3), the 
series of answers form a correlated sequence with each step depending 
on the previous one. Therefore e q measures in fact the performance of 
a "path" , an interdependent set of decisions. 

c) For L — 1 the algorithm reduces of course to the usual "perceptron 
rule" (for a\ = 0) or to the usual "unsupervised Hebb rule" (for a 2 = 
2ai). We have on general arguments p = 1 for the first, p = 0.5 for the 
second case. 

In the following we shall present preliminary results from an on going project 
(Kiihn and Stamatescu 1998). Here a±, a 2 are either fixed or tuned according 
to the following rule: 

a i(<?) = ^e q (e q + |), a 2 (q)=ae q . (7) 
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Figure 1: Error vs number of training periods for N = 200, various L. For 
the "perceptron" rule a\ = 0., for the "2-Hebb" rule a\j(ii : 0.2 — 0.25. The 
"tuning" is given by eq. (7) with a = 0.1/ N. 

We use M = 10000 and Q up to 8.10 5 . We tested various combinations of 
L — 1,5, 10, 15 and iV = 50, 100, 200, 300. The general observations are: 

- For fixed a\, 02 and for L of 5 and higher there is a rather narrow region 
of ratios r = ai/a 2 , namely r: 0.2 - 0.25 for which we have convergence, 
in particular we find no convergence for a\ — 0. See Fig. |I|. For L = 1 
varying a\ interpolates between perceptron and Hebbian learning, we 
did not performed a systematic analysis for L = 1, however. 

- The asymptotic behavior with Q seems to be well reproduced by a 
power law. For fixed ai,a 2 the exponent appears to depend on L, 
however we may not have yet achieved convergence even for N = 300. 
See Figs. ^ and |3|. For a 1; a 2 tuned according to eq. (7) the asymptotic 
behavior approaches the simple perceptron (p = 1). See Fig. [|. The 
exponent results are summarized in Table 1. 

P{Q) behaves very similarly to e(Q). Introducing thresholds or noise changes 
the particular results but does not seem to modify the general picture. Using 
smooth response functions appears also not to modify this picture, at least 
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Figure 2: Error vs number of training periods for N = 200, L = 10, "2-Hebb" 
rule with a\ = 0.01/iV, 02 = 0.05/iV; p = 0.33. The right hand plot gives 
directly the logarithms in the "asymptotic" regime. 

as long as the non-linear character is preserved. A more detailed analysis 
will be presented elsewhere. 





L = 5 


L = 10 


L = 15 


L = 15 (tuned) 


L = 15 (tuned) 


N 


Q max = 400000 


Q max = 800000 


50 


0.42(4) 


0.23(1) 


0.14(1) 


0.84(2) 




100 


0.43(3) 


0.27(1) 


0.18(2) 


0.80(3) 


0.93(4) 


200 


0.52(4) 


0.30(2) 


0.19(1) 


0.65(5) 




300 






0.18(1) 


0.60(5) 





Table 1: Exponents of the power law ansatz for the error dependence on the 
number of training steps. 



Attempting to understand the above results we use a coarse grained de- 
scription of the training process by introducing a mean weight vector U(q) 
averaging over a learning period. For the coarse grained vector U(q) the 
updating reads: 

U(q + 1) = U(q) + ( fll - a 2 e q ) £ s ^) s fe0 ( 8 ) 

1=1 
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Figure 3: Error vs number of training steps for N = 100, L = 15, "2-Hebb" 
rule with a\ = 0.01/iV 7 02 = 0.05/iV/ p = 0.168. The left hand plot presents 
also the convergence of the weight ray - see eq. (5). The right hand plot gives 
directly the logarithms in the "asymptotic" regime. 



Taking the scalar product with s < - g ' l ' S) and using the randomness of the pat- 
terns we obtain after further approximations 

z(q + 1) - z(q) = -c (ai - y z(qf) (1 - z(q)) (9) 

where c is some positive constant and z(q) is the coarse grained quantity 
associated to 

x to,l) = ( t (5,0 _ s (9.0)t(«,0 = {0, 2} (10) 

From eq. (9) we see that asymptotically convergence requires 

ai > ^-z(q) > for z(q) < 1 (11) 

while at the beginning we may need 

a x < °^z{q) for z(q) > 1 (12) 

This very rough analysis (which can only serve as orientation) seems to agree 
unexpectedly well with the numerical results. The tuning eq. (7) has been 
chosen in agreement with eq. (11). 
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Figure 4: Error vs number of training periods for N = 100, L = 15, "tuned 
2-Hebb" rule with a = O.l/N, see eq. (7). The "asymptotic" exponent is 
p = 0.93. The left hand plot presents also the convergence of the weight 
ray - see eq. (5). The right hand plot gives directly the logarithms in the 
"asymptotic" regime. 

The following simple observation may help to understand these findings: 
since for L = 1 e q can only take the values and 1^=0 means penalty for 
failure, no change for success, i.e. the usual perceptron learning rule known 
to converge. However, for L > 1 e q can take fractional values in the interval 
[0, 1]. In this case a\ = means penalty for all answers which are short of 
perfect. A special case may be L = 2: then there is only one intermediate 
value, e q = 0.5, meaning "undecided", and putting a penalty on it does not 
destabilize the system. 

3 A stochastic learning model in a (slightly) 
more complex situation 

In partial analogy with the model described in Mlodinow and Stamatescu 
(1985) we consider a "robot" moving on a board with obstacles under the 
following conditions: 

I. The board is realized as a grid and the robot can look and move one 
step in each of the 4 directions (forth, right, back and left). 
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output ("motor") neurons 

Figure 5: Architecture of the robot. 



2. The robot's architecture consists of an input layer of 5 x 4 Ising neu- 
rons and an output layer of 4 ("motor") neurons, each responsible for 
moving in one direction. See Fig. [|. The modifiable synapses are uni- 
directional from all the input neurons to the output ones (120 weights). 
The output neurons are assumed to be interconnected such as to ensure 
a "winner takes all" reaction, with the winner being decided probabilis- 
tically on the basis of the height of the presynaptic potentials.^ 

3. The immediate input is the state of the 4 neighboring cells (free- 
occupied); this information is always loaded in the the first 4 input 
neurons and transferred to the next group of 4 at the next step. Hence 
the robot has at each step as input the situations it has seen at present 
and in the last 4 steps. For this input the presynaptic potentials of the 
motor neurons are calculated and a move is attempted. 

4. The updating of the weights is achieved in 2 stages: 

a. - At each step a "blind" Hebb potentiation/inhibition is performed, 
considering only the actual states of the input and output neurons. If 
the prospected move runs against an obstacle the state of the output 
neuron responsible for this move is reversed before the Hebb updating 
is done. 

lr The interconnection in the output layer has not been realized in the architecture and 
the corresponding reaction has been put in by hand, since this question did not belong to 
the ones we wanted to study - it is supposed the corresponding interconnection could be 
realized. 
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b. - The robot starts from some given cell on the bottom of the board 
and moves. It stops if it arrives at the upper line of the board or if 
it had made some predefined, maximal number of steps. The number 
of steps at stopping is compared with a predefined success marge and 
the difference is used to Hebb- inhibit/potentiate equally all synaptic 
updates it has performed on the way (step a.). 

5. The weights are not normalized; the various parameters (noise, thresh- 
olds, the amplitudes of the two updates) are given by best guess - in a 
further development they may be left to the network itself to optimize 
(see Mlodinow and Stamatescu 1985). 

The performance of the robot is illustrated in Figs. || and |7]. 

4 Conclusions 

The results described in section 2 indicate that the approach taken here can 
be analized systematically and shows universal features. Notice that the 
parameter a\ is of special significance: on the one hand it gives the interde- 
pendency between the particular steps during the learning period and on the 
other hand it turns out to be useful or even necessary for the convergence 
of the training. The ratio between the "blind association" parameter ai and 
the "unspecific reinforcement" parameter 02 does not appear arbitrary. 

The illustration provided by the simulation of section 3 shows that the 
elementary, primitive "stochastic learning" considered here may cope also 
with more "differentiated" situations. This suggests that such mechanisms 
may well be of a rather fundamental nature. One can also ask about the 
possibility of developing on this basis strategies for more evolved artificial 
intelligence systems. 
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Figure 6: Typical results for 6 tests on which the robot is trained consecutively 
(from up to down). From left to right: first run, early behavior, late behavior 
(including transients or alternatives). 
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Figure 7: Performances for the 6 tests of Fig. [6|. The full line is the histogram 
of the total number of steps, including repeatedly running against obstacles 
(counted explicitly by the lower histogram - dotted line). The parameters are 
not optimized. 



13 



